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Abstract 

We use convex relaxation techniques to provide a sequence of solutions to the matrix completion 
problem. Using the nuclear norm as a regularizer, we provide simple and very efficient algorithms for 
minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm iteratively 
replaces the missing elements with those obtained from a thresholded SVD. With warm starts this allows 
us to efficiently compute an entire regularization path of solutions. 

1 Introduction 

In many applications measured data can be represented in a matrix X^xm for which only a relatively 
small number of entries are observed. The problem is to "complete" the matrix based on the observed 
entries, and has been dubbed the matrix completion problem [CCS08, CR08, RFP07, CT09, KOM09]. The 
"Netflix" competition is a primary example, where the data is the basis for a rccommender system. The 
rows correspond to viewers and the columns to movies, with the entry Xij being the rating S {1, . . . , 5} by 
viewer i for movie j. There are 480K viewers and 18K movies, and hence 8.6 billion (8.6 x 10^) potential 
entries. However, on average each viewer rates about 200 movies, so only 1.2% or 10^ entries are observed. 
The task is to predict the ratings viewers would give for the movies they have not yet rated. 

These problems can be phrased as learning an unknown parameter (a matrix Zjnxn) with very high 
dimensionality, based on very few observations. In order for such inference to be meaningful, we assume 
that the parameter Z lies in a much low dimensional manifold. In this paper, as is relevant in many real 
life applications, we assume that Z can be well represented by a matrix of low rank, i.e. Z « VmkGkn, 
where k <^ min(n, m). In this recommender system example, low rank structure suggests that movies can 
be grouped into a small number of "genres" , with Gij the relative score for movie j in genre i. Viewer i on 
the other hand has an affinity Vu for genre i, and hence the modeled score for viewer i on movie j is the sum 
"^i^i VaGij of genre affinities times genre scores. Very recently [CR08, CT09, KOM09] showed theoretically 
that under certain assumptions on the entries of the matrix, locations and proportion of unobserved entries, 
the true underlying matrix can be recovered within very high accuracy. Typically we view the observed 
entries in X as the corresponding entries from Z contaminated with noise. 

For a matrix X„ixn let C {1, . . . , m} x {1, . . . , n} denote the indices of observed entries. We consider 
the following optimization problem: 

minimize rank(Z) 
subject to {Z^j~Xi.j f<5, (1) 

where (5 > is a regularization parameter controlling the tolerance in training error. The rank constraint 
in (1) makes the problem for general combinatorially hard [NJ03]. For a fuUy-obscrvcd X, on the other 
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hand, the solution is given by the singular value decomposition (SVD) of X. The following seemingly small 
modification to (1) 

minimize \\Z\\* 

subject to {Z,j~X,jf<5 (2) 

(i,i)en 

makes the problem convex [Faz02]. Here \\Z\\<, is the nuclear norm, or the sum of the singular values of Z. 
Under many situations the nuclear norm is an effective convex relaxation to the rank constraint as explored 
in [Faz02, CR08, CT09, RFP07]. Optimization of (2) is a semi-definite programming problem [BV04, Faz02] 
and can be solved efficiently for small problems, using modern convex optimization software like SeDuMi and 
SDPT3. However, since these algorithms are based on second order methods [LV08], the problems become 
prohibitively expensive if the dimensions of the matrix exceeds a hundred [CCS08] . In this paper we propose 
an algorithm that scales to large problems with m, n « 10'*-10^ or even larger. We obtain a rank-11 solution 
to (2) for a problem of size (5 x 10^) x (5 x 10^) and |ri| = 10^ observed entries in under 11 minutes in 
MATLAB. For the same sized matrix with = 10^ we obtain a rank-52 solution in under 80 minutes. 
[CT09, CCS08, CR08] consider the criterion 

minimize ||^||* 
subject to Zij = Xij, V(i,j) G ^2 (3) 

When (5 = 0, criterion (1) is equivalent to (3), in that it requires the training error to be zero. [CT09, CR08] 
further develop theoretical properties establishing the equivalence of the rank minimization and the nuclear 
norm minimization problems (1,3). Cai et. al. [CCS08] in their paper propose a first-order singular-value- 
thresholding algorithm scalable to large matrices for the problem (2) with (5 = 0. They comment on the 
problem (2), with (5 > 0, and suggest that it becomes prohibitive for large scale problems. Hence they 
consider the 5 > case to be unsuitable for matrix completion. 

We believe that (3) will almost always be too rigid, as it will force the procedure to overfit. If minimization 
of prediction error is our main goal, then the solution Z* will typically lie somewhere in the interior of the 
path (Figure 1), indexed by 5. 

In this paper we provide an algorithm for computing solutions of (2), on a grid of 5 values, based on warm 
restarts. The algorithm is inspired by Hastie et al.'s SVD- impute [HTS+99, TCS+01] and is very different 
the proximal forward-backward splitting method of [CCS08, CW05, SMC08], which requires the choice of a 
step size. In [SMC08], the SVD step becomes prohibitive, so some randomized algorithms are used for the 
computation. Our algorithm is very different, and by exploiting matrix structure can solve problems much 
larger than those in [SMC08]. 

Our algorithm requires the computation of a low-rank SVD of a matrix (which is not sparse) at every 
iteration. Here we crucially exploit the problem matrix structure: 

Y ^Ysp (Sparse) + Ylr (Low Rank) (4) 

In (4) Ysp has the same sparsity structure as the observed X, and Ylr has the rank r ^ m, n of the 
estimated Z . For large scale problems, we use iterative methods based on Lanczos bidiagonalization with 
partial re-orthogonalization (as in the PRO PACK algorithm [Lar98]), for computing the first few singular 
vectors/values of Y. Due to the specific structure of (4), multiplication by Y and Y' can both be done in a 
cost-efficient way.. 



2 Algorithm and Convergence analysis 
2.1 Notation 

We adopt the notation of [CCS08]. Define a matrix PniX) (with dimension n x m) 
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which is a projection of the matrix Ymxn onto the observed entries. In the same spirit, define the com- 
plementary projection Pq;{Y) via P^^iY) + Pn{Y) = Y. Using (5) we can rewrite ^ ^ij)^ 
\\Pn{Z)-Pn{X)\\l. 

2.2 Nuclear norm regularization 

We present the following lemma, given in [CCS08], which forms a basic ingredient in our algorithm. 
Lemma 1. Suppose the matrix Wmxn has rank r. The solution to the convex optimization problem 

minimize ^\\Z ~ Wfp + X\\Z\\, (6) 
z 

is given by W = Sx{W) where 

Sx{W) = UDxV' with i^A =diag[(di -A)+,...,K- A)+], (7) 
where X = UDV is the SVD ofW, D ^ diag [di,...,dr], and t+ = max(t, 0). 

The notation S\{W) refers to soft-thresholding [DJKP95]. The proof follows by looking at the sub- 
gradient of the function to be minimized, and is given in [CCS08]. 

2.3 Algorithm 

Problem (2) can be written in its equivalent Lagrangian form 

minimize \\\Pn{Z) - Pn{X)\\l + A|1^|U (8) 

Here A > is a regularization parameter controlling the nuclear norm of the minimizcr Z\ of (8) (with a 
1-1 mapping to (5 > in (2)). We now present an algorithm for computing a series of solutions to (8) using 
warm starts. Define fx{Z) = \\\Pn{Z) - Pn{X)fp + A||Z||,. 

Algorithm 1 Soft-Impute 

1. Initialize Z"'*^ = and create a decreasing grid A of values Ai > . . . > A^. 

2. For every fixed A = Ai, A2, . . . S A iterate till convergence: 

(a) Compute ^ Sa(Po(^) + Pii{Z°''')) 

(b) If "^'^^^/TfroMg"^"" < e, go to step 2e. 

(c) Assign ^ and go to step 2b. 

(d) Assign Zx <~ and Z"'^ ^ Z"™ 

3. Output the sequence of solutions Zx-^ , • . • , Zx,^ ■ 



The algorithm repeatedly replaces the missing entries with the current guess, and then updates the guess 
by solving (8). Figure 1 shows some examples of solutions using Algorithm 1 (blue curves). We see test and 
training error in the left two columns as a function of the nuclear norm, obtained from a grid of values A. 
These error curves show a smooth and very competitive performance. 
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2.4 Convergence analysis 

In this section we prove that Algorithm 1 converges to the solution to (2). 
For an arbitrary matrix Z, define 

Qx{Z\Z) = \\\Pn{X) + P,\{Z) - Z\\l + \\\Z\U, (9) 

a surrogate of the objective function fx{z). Note that f\{Z) = Q\{Z\Z) for any Z. 
Lemma 2. For every fixed A > 0, define a sequence Z\ by 

Z^+i = argmmQ;,(Z|Z^), (10) 

with Z^ = 0. The sequence Z\ satisfies 

hizt"') < Qx{z^^'\z',) < h{zl) (11) 



Proof. 



fxiZl) = '^\\PniX) + P,iiZ^,)-Z',\\l + X\\Z^,\U 

> imn{\\Pn{X) + P^(Z^-) - Z\\% + X\\Z\U} 

= Qx{Z^+'\Z',) 

= ^\\{Pn{X) ~ Pn{Z^+')} +{P^{Z',)-P,i{Z^+')}\\l + X\\Zl+'\U 

= i {\\PniX) PniZ^-^'m + WPniZ'x) (^a^')}^} + M\Z^^'\ 



> i||Po(x)-p^.(^r^)iii^+Aiizrii 



Q,{Z^+'\Z^+') 



□ 



Lemma 3. The nuclear norm shrinkage operator Sa(-) satisfies the following for any Wi, W2 (with matching 
dimensions) 

\\Sx{W,) - Sx{W2)\\j, < \\Wi - W2\\l (12) 

Proof. We omit the proof here for the sake of brevity. The details work out by expanding the operator Sa(-) 
in terms of the singular value decomposition of Wi and W2 ■ Then we use trace inequalities for the product of 
two matrices [Las95] where one is real symmetric, the other arbitrary. A proof of this Lemma also appears 
in [SMC08], though the method is different from ours. □ 

Lemma 4. Suppose the sequence Z^ obtained from (10) converges to Z'^ . Then is a stationary point 
of.fx{Z). 

Proof. The sub-gradients of the nuclear norm ||^||, are given by [CCS08] 

d\\Z\\, = {UV + W : Wraxn. U'W = 0, WV = 0, \\W\\2 < 1} (13) 

where Z = UDV is the SVD of Z. Since minimizes Qa(^|^a"^), it satisfies: 

e -iPniX) + PiiZ^') - Zl) + d\\Zl\\, Vfc (14) 



Since Z^ ^ Zf , 



(PniX) + P,UZ^-') Zl) ^ (PniX) Pn{ZT))- (15) 
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For every k, a sub-gradient p{Z^) G corresponds to a tuple {uk,Vk,Wk)- Then (passing on to a 

subsequence if necessary), (uk^VkjUUk) {uoo,Voo,Woo) and this hmit corresponds to p{Z^) g 
Hence, from (14, 15), passing on to the Umits 

0^{Pn{X)-Pn{Z^)) + d\\zn* (16) 

This proves the stationarity of the hmit . □ 
Theorem 1. The sequence Z^ defined in Lemma 2 converges to Z'^ which solves 

min^jP^iZ) - PniX)\\l. + X\\Z\U (17) 

Proof. Firstly observe that the sequence Z^ is bounded; for it to converge it must have a unique accumulation 
point. 



Observe that 



II^a^'-^aIII = \\S,{Pn{X) + P^(Z',))~S,{Pn{X) + P^{Z^~'m% 

(by Lemma 3) < || {PniX) + P,i{Z'^)) - {Pn{X) + P,i{Z^-')) \\% 

= \\p^{z^,-z^-')\\l 

< \\Zl-Zl-'\\l (18) 



Due to boundedness, every infinite subsequence of Z^ has a further subsequence that converges. If the 
sequence Z^ has two distinct limit points then for infinitely many k' > 0, \\Z^ ~ Z\ ~^\\f > e, for some e > 0. 
Using (18) this contradicts the convergence of any subsequence of Z\. Hence the sequence Z^' converges. 
Using Lemma 4, the limit is a stationary point of f\{Z) and hence its minimizer. □ 



3 From soft to hard-thresholding 

The nuclear norm behaves like a norm, and can be viewed as a soft approximation of the £o norm or 
rank of a matrix. In penalized linear regression for example, the £i norm or LASSO [Tib96] is widely used 
as a convex surrogate for the penalty or best-subset selection. The LASSO performs very well on a 
wide variety of situations in producing a parsimonious model with good prediction error. However, if the 
underlying model is very sparse, then the LASSO with its uniform shrinkage can overestimate the number of 
non-zero coefhcients. In such situations concave penalized regressions are gaining popularity as a surrogate 
to £o- By analogy for matrices, it makes sense to go beyond the nuclear norm minimization problem to more 
aggressive penalties bridging the gap between and ^Q. We propose minimizing 

/p,a(Z) = \\\Pn{Z)-Pn{X)\\l + \Y,p{HZ)-n) (19) 

j 

where p{\t\]^) is concave in \t\. The parameter 7 S [7inf,7sup] controls the degree of concavity, with 
p(|^|;7inf) = 1^1 (^1 penalty), on one end and p{\t\;^sn-p) = (^0 penalty) on the other. In particular 
for the £0 penalty denote fp.\{Z) by fH,x{Z) for "hard" thresholding. Sec [FriOS, FLOl, Zha07] for examples 
of such penalties. 

Criterion (19) is no longer convex and hence becomes more difficult. It can be shown that Algorithm 1 
can be modified in a suitable fashion for the penalty p(-; 7). This algorithm also has guaranteed convergence 
properties. The details of these arguments and statistical properties will be studied in a longer version of 
this paper. As a concrete example, we present here some features of the ^0 norm regularization on singular 
values. 

The version of (6) for the £0 norm is 

um,\\\Z-W\\l + \\\Z\W (20) 
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The solution is given by a reduced-rank SVD of W; for every A there is a corresponding q = q{X) number 
of singular- values to be retained in the SVD decomposition. As in (7), the thresholding operator resulting 
from (20) is 

S^{W)^UDyV' where = diag (di, . . . , dg, 0, . . . , 0) (21) 

Similar to Soft-Impute (Algorithm 1), the algorithm Hard-Impute for the £o penalty is given by 
Algorithm 2. 

Algorithm 2 Hard-Impute 

1. Create a decreasing grid A of values Ai > . . . > Aa'- Initialize k = 1, . . . ,K (see Section 3.1). 

2. For every fixed A = Ai, A2, . . . G A iterate till convergence: 

(a) Initialize Z°^<^ ^ Zx- 

(b) Compute Z"™ ^ Sf (Pn(A) + P^{Z°^'^)) 

(c) If ''^'^^7/7(g°^')ir'^''" < ^' g° *° «*°P 

(d) Assign Z"'^ ^ Z"""^ and go to step 2b. 

(e) Assign Zh^x ^ Z"°*- 

3. Output the sequence of solutions Zh.Xi , ■ ■ ■ , Z\j^ . 



3.1 Post-processing and Initialization 

Because the £1 norm regularizes by shrinking the singular values, the number of singular values retained 
(through cross-validation, say) may exceed the actual rank of the matrix. In such cases it is reasonable to 
undo the shrinkage of the chosen models, which might permit a lower-rank solution. 

If Zx is the solution to (8), then its post-processed version Z^ obtained by "unshrinking" the cigen- values 
of the matrix Zx is obtained by 

'•a 

a = argmin ||Pjj(A) - ^ a,Fn(^i.v-) II' (22) 

ai>0.. i—l....,rx 

Zl = UD^V, 

where Da ~ diag(ai, . . . , a^A)- Here rx is the rank of Zx and Zx = UDxV is its SVD. The estimation in 
(22) can be done via ordinary least squares, which is feasible because of the sparsity of PniuiV^) and that 
rx is small. ^ If the least squares solutions a do not meet the positivity constraints, then the negative sign 
can be absorbed into the corresponding singular vector. 

In many simulated examples we have observed that this post-processing step gives a good estimate of the 
underlying true rank of the matrix (based on prediction error). Since fixed points of Algorithm 2 correspond 
to local minima of the function (19), well-chosen warm starts Zx are helpful. A reasonable prescription for 
warms-starts is the nuclear norm solution via (Soft-Impute), or the post processed version (22). The latter 
appears to significantly speed up convergence for Hard-Impute. 

3.2 Computation 

The computationally demanding part of Algorithms 1 and 2 is in Sx{Pn{X) + P^{Z^)) or S^{Pn{X) + 
-^ni^H x))- These require calculating a low- rank SVD of the matrices of interest, since the underlying 

^Observe that the Pfi{uiv'^), i = 1, . . . ,rx are not orthogonal, though the Uiv'^ are. 



6 



model assumption is that rank(Z) <C min{m, n}. In Algorithm 1, for fixed A, the entire sequence of matrices 
have explicit low-rank representations of the form UkDkVl corresponding to Sx{P{i{X) + Pq{Z^~^)) 
In addition, observe that Pfi{X) + P^'^iZ^) can be rewritten as 

Po(X) + Psi(Zj:) = {Pjj(X)-Pj^(Z^')} (Sparse) + (LowRank) (23) 

In the numerical linear algebra literature, there are very efficient direct matrix factorization methods for 
calculating the SVD of matrices of moderate size (at most a few thousand). When the matrix is sparse, 
larger problems can be solved but the computational cost depends heavily upon the sparsity structure 
of the matrix. In general however, for large matrices one has to resort to indirect iterative methods for 
calculating the leading singular vectors/ values of a matrix. There is a lot research in the numerical linear 
algebra for developing sophisticated algorithms for this purpose. In this paper we will use the PROPACK 
algorithm [Lar, Lar98] because of its low storage requirements, effective flop count and its well documented 
MATLAB version. The algorithm for calculating the truncated SVD for a matrix W (say) , becomes efficient 
if multiplication operations Wbi and W'b2 (with &i G 62 G 5ft™) can be done with minimal cost. 

Our algorithms Soft-Impute and Hard-Impute both require repeated computation of a truncated 
SVD for a matrix W with structure as in (23). Note that in (23) the term Pn{Z^) can be computed in 
0(|r2|r) flops using only the required outer products. 

The cost of computing the truncated SVD will depend upon the cost in the operations Wbi and 
(which are equal). For the sparse part these multiplications cost 0(|fi|). Although it costs 0(|r2|r) to create 
the matrix Pfi{Z^)), this is used for each of the r such multiplications (which also cost 0{\il\r)), so we need 
not include that cost here. The LowRank part costs 0{{m + n)r) for the multiplication by bi. Hence the 
cost is 0(|ri|) -t- 0((m -I- n)r) per multiplication, cost. 

For the reconstruction problem to be theoretically meaningful in the sense of [CT09], we require that 
f2| « nrpoly(logri). Hence introducing the LowRank part does not add any further complexity in the 
multiplication by W and W' . So the dominant cost in calculating the truncated SVD in our algorithm is 
0(|f2|). The SVT algorithm [CCS08] for exact matrix completion (3) involves calculating the SVD of a sparse 
matrix with cost 0(|il|). This implies that the computational cost of our algorithm and that of [CCS08] 
is the same. Since the true rank of the matrix r <C min{m, n}, the computational cost of evaluating the 
truncated SVD (with rank « r) is linear in matrix dimensions. This justifies the large-scale computational 
feasibility of our algorithm. 

The PROPACK package does not allow one to request (and hence compute) only the singular values 
larger than a threshold A — one has to specify the number in advance. So once all the computed singular 
values fall above the current threshold A, our algorithm increases the number to be computed until the 
smallest is smaller than A. In large scale problems, we put an absolute limit on the maximum number. 

4 Simulation Studies 

In this section we study the training and test errors achieved by the estimated matrix by our proposed 
algorithms and those by [CCS08, KOM09]. The Reconstruction algorithm (Rcon) described in [KOM09] 
considers criterion (1) (in presence of noise). For every fixed rank r it uses a bi-convex algorithm on 
a Grassmanian Manifold for computing a rank-r approximation USV (not the SVD). It uses a suitable 
starting point obtained by performing a sparse SVD on a clean version of the observed matrix Pii{X). To 
summarize, we look at the performance of the following methods: 

• (a) Soft-Impute (algorithm 1); (b) Post-processing on the output of Algorithm 1, (c) Hard-Impute 

(Algorithm 2) starting with the output of (b). 

• SVT algorithm by [CCS08] 

• Rcon reconstruction algorithm by [KOM09] 
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(m, n) 


\n\ 


true rank (r) 


SNR 


effective rank (f) 


# Iters 


timc(s) 


(3 X 10^104) 


10^ 


15 


1 


(13,47,80) 


(3,3,3) 


(41.9,124.7, 305.8) 


(5 X W,5 X 10^) 


10^ 


15 


1 


8 


80 


237 


(10^105) 


10^ 


15 


10 


(5,14,32,62) 


(3,3,3,3) 


(37,74.5,199.8,653) 


(105,10^) 


10^ 


15 


10 


(18,80) 


(3,3) 


(202,1840) 


(5 X 10^5 X 10^) 


10* 


15 


10 


11 


3 


628.14 


(5 X 10^5 X 10^) 


10^ 


15 


1 


(3,11,52) 


(3,3,3) 


(341.9,823.4,4810.75) 



Table 1: Performance of the Soft-Impute on different problem instances. 



In all our simulation studies we took the underlying model as ^rnxn — ^rnxr^rxn ^~ noiscj whcrc U and 
V are random matrices with standard normal Gaussian entries, and noise is iid Gaussian, fl is uniformly 
random over the indices of the matrix with pVo percent of missing entries. These are the models under which 
the coherence conditions hold true for the matrix completion problem to be meaningful as pointed out in 
[CT09, KOM09]. The signal to noise ratio for the model and the test-error (standardized) are defined as 



SNR = \ / 7^ ^; testerror = " \ , — 24) 

Yvar(noise)' \\Pn{UV')\\l ^ ' 

In Figure 1, results corresponding to the training and test errors are shown for all algorithms mentioned 
above — nuclear norm (left two panels) and rank (right two panels) — in three problem instances. Since 
Rcon only uses rank, it is excluded from the left panels. In all examples (m, n) ~ (100, 100). SNR, true rank 
and percentage of missing entries are indicated in the figures. There is a unique correspondence between A 
and nuclear norm. The plots vs the rank indicate how effective the nuclear norm is as a rank approximation 
— that is whether it recovers the true rank while minimizing prediction error. We summarize our findings 
in the caption of the figure. 

In addition we performed some large scale simulations in Table 1 for our algorithm in different problem 
sizes. The problem dimensions, SNR, number of iterations till convergence and time in seconds are reported. 
All computations are done in MATLAB and the MATLAB version of PROPACK is used. 
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Figure 1: LI: solution for Soft-Impute; Ll-U: Post processing after Soft-Impute; Ll-LO Hard-Impute 
appUed to Ll-U; C : SVT algorithm; M: Recon algorithm. Soft-Impute performs well in the presence 
of noise (top and middle panel). When the noise is low, Hard-Impute can improve its performance. The 
post-processed version tends to get the correct rank in many situations as in Types b,c. In Type b, the post- 
processed version does better than the rest in prediction error. In all the situations SVT algorithm does 
very poorly in prediction error, confirming our claim that (3) causes overfitting. Recon predicts poorly as 
well apart from Typc-c, where it gets better error than Soft-Impute. However Hard-Impute and Recon 
have the same performance there. 
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